1 Exploratory Data Analysis

Activity prompt: What are the first steps to take when you start a project or get ahold of a new data set?

Additional reading:

In a real data science project, you will typically have a question or idea brought to you: How are people interacting with the new version of my software? How is weather in Alaska changing in the last decade? What type of people are most likely to enroll in Obamacare?

You will sometimes be given a dataset when asked one of these questions, or more often, a general description of the data you could get if you talked to the right people or interacted with the right software systems.

In this section, we will talk about Exploratory Data Analysis (EDA), a name given to the process of 1) “getting to know” a dataset, and 2) trying to identify any meaningful insights within it. Grolemund and Wickham visualize the place of this “Understand” process with a simple diagram:

The process of EDA, as described by Grolemund and Wickham.

Figure 1.1: The process of EDA, as described by Grolemund and Wickham.

We view the process similarly:

  1. Understand the basic data that is available to you.
  2. Visualize and describe the variables that seem most interesting or relevant.
  3. Formulate a research question.
  4. Analyze the data related to the research question, starting from simple analyses to more complex ones.
  5. Interpret your findings, refine your research question, and return to step 4.

1.1 Understand the Basic Data

Start by understanding the data that is available to you. If you have a codebook, you have struck gold! If not (the more common case), you’ll need to do some detective work that often involves talking to people. At this stage, ask yourself:

  • Where does my data come from? How was it collected?1
  • Is there a codebook? If not, how can I learn about it?
  • Are there people I can reach out to who have experience with this data?

Next, you need to load the data and clean it. Once the data is loaded, ask yourself about each table:

  • What is an observation?
  • How many observations are there?
  • What is the meaning of each variable?
  • What is the type of each variable (date, location, string, factor, number, boolean, etc.)?

Some great methods to start with are the functions

  • glimpse() to learn about the numbers of variables and observations as well as the classes of variables
  • head() to view the top of the data table (can specify the number of rows with n= )
  • tail() to view the bottom of the data table

Here is an example with the Kaggle World Sustainability Dataset:

world_sust<-read_csv("http://faculty.olin.edu/dshuman/DS/WorldSustainabilityDataset.csv")%>%
    janitor::clean_names()

The data dictionary is here.

world_sust_15<-world_sust%>%
  filter(year==2015)%>%
  rename(population=population_total_sp_pop_totl,
         region_un=world_regions_un_sdg_definition,
         rural_pop=rural_population_percent_of_total_population_sp_rur_totl_zs,
         life_expectancy=life_expectancy_at_birth_total_years_sp_dyn_le00_in,
         internet=individuals_using_the_internet_percent_of_population_it_net_user_zs,
         elec_access=access_to_electricity_percent_of_population_eg_elc_accs_zs,
         gdp_per_capita=gdp_per_capita_current_us_ny_gdp_pcap_cd,
         compulsory_ed_years=compulsory_education_duration_years_se_com_durs,
         primary_enroll=school_enrollment_primary_percent_gross_se_prm_enrr,
         secondary_enroll=school_enrollment_secondary_percent_gross_se_sec_enrr,
         unemployment_male=unemployment_rate_male_percent_sl_tlf_uem_8_5_2,
         unemployment_women=unemployment_rate_women_percent_sl_tlf_uem_8_5_2,
         drinking_water_perc=proportion_of_population_using_basic_drinking_water_services_percent_sp_acs_bsrvh2o_1_4_1,
         below_poverty_line=proportion_of_population_below_international_poverty_line_percent_si_pov_day1_1_1_1,
         undernourishment=prevalence_of_undernourishment_percent_sn_itk_defc_2_1_1)%>%
  select(country_name,population,region_un,rural_pop,life_expectancy,internet,elec_access,gdp_per_capita,compulsory_ed_years,primary_enroll,secondary_enroll,unemployment_male,unemployment_women,drinking_water_perc,below_poverty_line,undernourishment)
glimpse(world_sust_15)
## Rows: 173
## Columns: 16
## $ country_name        <chr> "Aruba", "Angola", "Albania", "United Arab Emirate…
## $ population          <dbl> 104339, 27884380, 2880703, 9262896, 43131966, 2925…
## $ region_un           <chr> "Latin America and Caribbean", "Sub-Saharan Africa…
## $ rural_pop           <dbl> 56.892, 36.554, 42.566, 14.326, 8.497, 36.915, 75.…
## $ life_expectancy     <dbl> 75.72500, 59.39800, 78.02500, 77.28500, 76.06800, …
## $ internet            <dbl> 88.66123, 29.00000, 56.90000, 90.50000, 68.04306, …
## $ elec_access         <dbl> 100.00000, 42.00000, 99.98000, 100.00000, 99.81168…
## $ gdp_per_capita      <dbl> 28396.9084, 4166.9798, 3952.8012, 38663.4005, 1378…
## $ compulsory_ed_years <dbl> 13, 6, 9, 6, 14, 12, 11, 10, 10, 10, NA, 12, 6, 10…
## $ primary_enroll      <dbl> NA, 113.47796, 105.54298, 111.54276, 111.30341, 95…
## $ secondary_enroll    <dbl> NA, NA, 97.38848, NA, 106.94647, 86.00589, 108.230…
## $ unemployment_male   <dbl> NA, NA, 17.3, NA, NA, 17.4, NA, 6.0, 6.1, 4.1, NA,…
## $ unemployment_women  <dbl> NA, NA, 17.1, NA, NA, 19.2, NA, 6.1, 5.3, 5.9, NA,…
## $ drinking_water_perc <dbl> NA, NA, 93, NA, NA, 100, NA, 100, 100, 92, NA, 100…
## $ below_poverty_line  <dbl> NA, NA, 1.1, NA, NA, 1.3, NA, NA, 0.7, NA, NA, 0.1…
## $ undernourishment    <dbl> NA, 14.5, 4.7, 2.7, NA, 2.8, NA, NA, NA, NA, NA, N…
head(world_sust_15)
## # A tibble: 6 × 16
##   country_name         population region_un   rural_pop life_expectancy internet
##   <chr>                     <dbl> <chr>           <dbl>           <dbl>    <dbl>
## 1 Aruba                    104339 Latin Amer…     56.9             75.7     88.7
## 2 Angola                 27884380 Sub-Sahara…     36.6             59.4     29  
## 3 Albania                 2880703 Europe and…     42.6             78.0     56.9
## 4 United Arab Emirates    9262896 Northern A…     14.3             77.3     90.5
## 5 Argentina              43131966 Latin Amer…      8.50            76.1     68.0
## 6 Armenia                 2925559 Northern A…     36.9             74.5     59.1
## # ℹ 10 more variables: elec_access <dbl>, gdp_per_capita <dbl>,
## #   compulsory_ed_years <dbl>, primary_enroll <dbl>, secondary_enroll <dbl>,
## #   unemployment_male <dbl>, unemployment_women <dbl>,
## #   drinking_water_perc <dbl>, below_poverty_line <dbl>, undernourishment <dbl>
tail(world_sust_15)
## # A tibble: 6 × 16
##   country_name  population region_un          rural_pop life_expectancy internet
##   <chr>              <dbl> <chr>                  <dbl>           <dbl>    <dbl>
## 1 Venezuela, RB   30081827 Latin America and…      11.8            72.6     NA  
## 2 Vietnam         92677082 Eastern and South…      66.2            75.1     45  
## 3 Vanuatu           271128 Oceania                 75.0            69.9     22.4
## 4 South Africa    55386369 Sub-Saharan Africa      35.2            62.6     51.9
## 5 Zambia          15879370 Sub-Saharan Africa      58.1            61.7     NA  
## 6 Zimbabwe        13814642 Sub-Saharan Africa      67.6            59.5     22.7
## # ℹ 10 more variables: elec_access <dbl>, gdp_per_capita <dbl>,
## #   compulsory_ed_years <dbl>, primary_enroll <dbl>, secondary_enroll <dbl>,
## #   unemployment_male <dbl>, unemployment_women <dbl>,
## #   drinking_water_perc <dbl>, below_poverty_line <dbl>, undernourishment <dbl>

Finally, ask yourself about the relationships between tables:

  • What variables link the tables (i.e., which variables can you use in join commands)?

1.2 Visualize and Describe the Data

Once you have the data loaded and cleaned, it is usually helpful to do some univariate visualization; e.g., plotting histograms, densities, and box plots of different variables. You might ask questions such as:

  • What do you see that is interesting?
  • Which values are most common or unusual (outliers)?
  • Is there a lot of missing data?
  • What type of variation occurs within the individual variables?
  • What might be causing the interesting findings?
  • How could you figure out whether your ideas are correct?

Once you have done some univariate visualization, you might examine the covariation between different variables. One convenient way to do this is with a pairs plot.

Here are three different versions of such plots. The main point of these plots is not necessarily to draw any conclusions, but to help generate more specific research questions and hypotheses.

pairs(world_sust_15[,c(2,4:9)], bg="lightblue",panel=panel.smooth)

ggpairs(world_sust_15[,2:9], aes(alpha = 0.4))

lowerFn <- function(data, mapping, method = "lm", ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "blue") +
    geom_smooth(method = method, color = "red", ...)
  p
}

ggpairs(
  world_sust_15[,2:9], lower = list(continuous = wrap(lowerFn, method = "lm")),
  diag = list(continuous = wrap("barDiag", colour = "blue")),
  upper = list(continuous = wrap("cor", size = 5))
)

1.3 Formulate a Research Question

You will often end up with a ton of data, and it can be easy to be overwhelmed. How should you get started? One easy idea is to brainstorm ideas for research questions, and pick one that seems promising. This process is much easier with more than one brain! You will often be working off of a broad question posed by your business, organization, or supervisor, and be thinking about how to narrow it down. To do so, you can again revisit questions like “What patterns do you see?” or “Why might they be occurring?”

2 Practice: Wastewater Treatement Plants

Wastewater treatment plants remove contaminents (pathogens, nutrients, organics, and other pollutants) from wastewater, and then typically return the treated water (an effluent) into surface waters such as streams, rivers, and wetlands. Here is an overview of how the Massachusetts sewer system works and more details on the major components of the Deer Island Wastewater Treamtement Plant. Here is a wastewater treatment overview video and its condensed version.

The HydroSHEDS project recently compiled data on 58,502 wastewater treatment plants into the HydroWASTE database. The authors published a 2022 article in Earth System Science Data describing the database. This data set was featured on TidyTuesday, a weekly data project where people from the R for Data Science (R4DS) online learning community each make a visualization from the same dataset and post them online. You should try to do TidyTuesday sometime.

Here are ten different tables for you to use for this activity:

ww_plants<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/HydroWASTE_v10.csv")%>%
  janitor::clean_names()
north_america_river_atlas<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/north_america_river_atlas.csv")
biome_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/biome_codebook.csv")
climate_zone_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/climate_zone_codebook.csv")
country_abbreviation_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/country_abbreviation_codebook.csv")
freshwater_ecoregion_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/freshwater_ecoregion_codebook.csv")
major_habitat_type_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/major_habitat_type_codebook.csv")
potential_natural_veg_class_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/potential_natural_veg_class_codebook.csv")
terrestrial_ecoregion_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/terrestrial_ecoregion_codebook.csv")
wetland_class_codebook<-read_csv("http://faculty.olin.edu/dshuman/DS/hydrowaste/wetland_class_codebook.csv")

Exercise 2.1 (Explore the wastewater plant data) Here is the data dictionary for the HydroWASTE data in ww_plants. What does each row correspond to? Make sure to understand the different variables. Where in the world are the wastewater plants in the database? Which plants have effluents that flow into oceans or large lakes? How many of those are there? What is df? How is calculuated? What df levels are commonly considered as cause for environmental concern? All of these questions can be answered by skimming the journal article.

waste_ID: id num of the treatment plant QUAL_LOC: quality of the treatment location by a scale of 1-4 LAT/LON_OUT: outfall location longitude or latitude (outfall loc being where the discharge is discharged) QUAL_POP: DF is calculated as a ratio of natural discharge of the receiving waterbody to the plant’s effluent discharge. DF less than 10% is typically concerning. ::: {.exercise #unnamed-chunk-10 name=“Explore the North America river atlas”} The north_america_river_atlas table is extracted from the HydroRIVERS database.

:::

Exercise 2.2 (Possible joins)

  1. What variable(s) link the ww_plants and north_america_river_atlas tables?

  2. Understand each of the other five variables in north_america_river_atlas (besides hyriv_id and main_riv). What do they mean? Which other tables can they be joined with?

  3. Create a north_america_river_atlas_expanded table where each row (river reach) includes all of the information that can be joined from the other codebooks you’ve been given. It should have 16 variables (and still have 986463 observations).

  1. hyriv_id connects the two tables together.
  2. Each of the remaining variables indicate some sort of identifier, which I assume means they identify the type of climate, wetland, or water region it is.
north_america_river_atlas_expanded <- 
  north_america_river_atlas %>%
  left_join(wetland_class_codebook, by=c("wetland_class_id"="wetland_class_id")) %>%
  left_join(freshwater_ecoregion_codebook, by=c("freshwater_ecoregion_id"="freshwater_ecoregion_id")) %>%
  left_join(major_habitat_type_codebook, by="major_habitat_type_id") %>%
  left_join(climate_zone_codebook, by="climate_zone_id") %>%
  left_join(terrestrial_ecoregion_codebook, by=c("terrestrial_ecoregion_id"="terrestrial_ecoregion_id")) %>%
  left_join(potential_natural_veg_class_codebook, by=c("potential_natural_veg_class_code"="potential_natural_veg_class_code"))

Exercise 2.3 (Broad research question(s)) There is a ton of data here, and it can be easy to be overwhelmed. Try to focus your exploration a bit by identifying one or two broad research avenues. Here are some examples, some of which are taken from the article on HydroWASTE (feel free to use these or come up with your own):

  • Better understand the wastewater plants that have dilution factors under 10: Where are they? What level facilities do they tend to be? To which river networks do their outfall locations lead?

  • Are there large geographic areas (by population or land area) whose wastewater is not being effectively treated by plants in the database?

  • For which wastewater plants might an upgrade in technology deliver the biggest improvement of downstream water quality?

  • Policy analysis: Where should local regulations be established to limit the release of pollutants? How are past policy decisions evident in the data?

  • Where does most treated water go? To rivers? To oceans or large lakes? Back into a water treatment plant for drinkable water (full cycle reuse)?

All of these topics are closely related the United Nations’ Sustainable Development Goal 6: “Ensure availability and sustainable management of water and sanitation for all.”

The avenue that I will be exploring is the first one listed regarding the plants whose dilution factors are under 10. Where are these plants locationed whose dilution factors are under 10? How large are the populations they serve? Which rivers do these plants outfall to and are they correlated with each other?

Exercise 2.4 (Visualize and describe the data) Use some univariate and bivariate visualizations to start to explore the questions you identified:

  • What do you see that is interesting?
  • Which values are most common or unusual (outliers)?
  • Is there a lot of missing data?
  • What type of variation occurs within the individual variables?
  • What might be causing the interesting findings?
  • How could you figure out whether your ideas are correct?
under_10_df <- filter(ww_plants, df<10) # filtering by dilution factor 

count(under_10_df, qual_loc)
## # A tibble: 4 × 2
##   qual_loc     n
##      <dbl> <int>
## 1        1   763
## 2        2   845
## 3        3   190
## 4        4   734

After filtering by dilution factors under 10 and observing the data, it is seen that there is not a singular specific region of the world where the treatment plants have a dilution factor under 10. Rather, there seems to be a wide variety of countries that this applies to, including the Europe area, North America, and Asia. Out of the 58502 plants that are observed in the ww_plants data set, 2532 have a dilution factor under 10. In terms of quality of location, the majority of these plants are at a level 2 or 1, which sits at the high end. This means that the quality of the information regarding the WWTP’s location is more accurate compared to others in the database.

# Visualization of distribution of population served by plants with df under 10
under_10_df <-
  under_10_df %>%
  mutate(pop_bin= case_when(pop_served < 1000 ~ "under 1k", 
                            pop_served >= 1000 & pop_served < 10000 ~ "1k-10k",
                            pop_served >= 10000 & pop_served < 50000 ~ "10k-50k",
                            pop_served >= 50000 & pop_served < 100000 ~ "50k-100k",
                            pop_served >= 100000 ~ "over 100k"))

ggplot(under_10_df, aes(x=pop_bin))+
  geom_bar()+
  labs(x= "Population Served", y="Number of Plants", title="Count of Plants with DF under 10 by Population Size Served")

ggplot(under_10_df, aes(x=level))+
  geom_bar()+
  labs(x="Level of Treatment", y="Number of Plants", title="Number of Plants with DF under 10 by Level of Treatment")

ggplot(under_10_df, aes(x=status))+
  geom_bar()+
  labs(x="Level of Treatment", y="Number of Plants", title="Number of Plants with DF under 10 by Status of Operation")

ggplot(under_10_df, aes(x=pop_served, y=df))+
  geom_point()

ggplot(under_10_df, aes(x=waste_dis,y=river_dis))+
  geom_point()

ggplot(under_10_df, aes(x=design_cap, y=pop_served))+
  geom_point()

ggplot(under_10_df, aes(x=design_cap, y=river_dis))+
  geom_point()

countree <-
  under_10_df %>%
  count(country) %>%
  arrange(desc(n)) %>%
  head(5)

under_10_df_rivers <- 
  under_10_df %>%
  left_join(north_america_river_atlas_expanded, by=c("hyriv_id"="hyriv_id"))

filter(under_10_df_rivers, !is.na(main_riv))
## # A tibble: 671 × 41
##    waste_id source org_id wwtp_name country cntry_iso lat_wwtp lon_wwtp qual_loc
##       <dbl>  <dbl>  <dbl> <chr>     <chr>   <chr>        <dbl>    <dbl>    <dbl>
##  1    24984      2 1.00e9 ATHENS W… United… USA           34.8    -87.0        2
##  2    24991      2 1.00e9 BOAZ SLA… United… USA           34.2    -86.2        2
##  3    25046      2 1.00e9 HUNTSVIL… United… USA           34.7    -86.6        2
##  4    25241      2 1.20e9 JEFFERSO… United… USA           33.4    -87.0        2
##  5    25242      2 1.20e9 JEFFERSO… United… USA           33.6    -86.9        2
##  6    25243      2 1.20e9 JEFFERSO… United… USA           33.5    -86.9        2
##  7    25269      2 4.00e9 Alpine S… United… USA           33.8   -109.         2
##  8    25270      2 4.00e9 Little C… United… USA           33.9   -110.         2
##  9    25275      2 4.00e9 Bisbee  … United… USA           31.4   -110.         2
## 10    25281      2 4.00e9 Willcox … United… USA           32.2   -110.         2
## # ℹ 661 more rows
## # ℹ 32 more variables: lat_out <dbl>, lon_out <dbl>, status <chr>,
## #   pop_served <dbl>, qual_pop <dbl>, waste_dis <dbl>, qual_waste <dbl>,
## #   level <chr>, qual_level <dbl>, df <dbl>, hyriv_id <dbl>, river_dis <dbl>,
## #   coast_10km <dbl>, coast_50km <dbl>, design_cap <dbl>, qual_cap <dbl>,
## #   pop_bin <chr>, main_riv <dbl>, wetland_class_id <dbl>,
## #   freshwater_ecoregion_id <dbl>, climate_zone_id <dbl>, …
world<- map_data("world") # getting map of world 
  # mutate(fill = ifelse(under_10_df$country %in% countree$country, "red", "white"))

# library(leaflet)
# leaflet(data = under_10_df) %>% #base plot
#   addTiles() %>% #base map - default is openstreet map 
#   tileOptions(noWrap=FALSE) %>%
#   addCircles(lng = ~lon_wwtp,
#              lat = ~lat_wwtp, 
#              label = ~wwtp_name) 

ggplot(data=under_10_df, aes(x=lon_wwtp, y=lat_wwtp)) +
  geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(size=0.2, aes(color=level)) +
  coord_map(xlim = c(-180,180)) +
  labs(x=element_blank(), y=element_blank(), title="Locations of WWTPs with DF under 10 colored by Treatment Level")

  # geom_point(aes(x=under_10_df$lon_wwtp, y=under_10_df$lat_wwtp, fill=under_10_df$pop_served))

Exercise 2.5 (Formulate a specific research question) Based on your preliminary visualizations and exploration of the date, formulate a more specific research question/hypothesis within your broad research question. The more you can iteratively narrow this question, the more interesting insights you might find. For example, you may want to hone in on a specific geographical region or wastewater plants with specific properties.

What factors may affect the treatment level of a WWTP with DF under 10?

Exercise 2.6 (Use visualizations to tell a story) Iteratively develop one to three visualizations that tell a story about your more specific research question/hypothesis. Note: the story may very well be something along the lines of “we thought variable X might be associated with pattern Y, but the evidence does not support that.”

ggplot(data=under_10_df, aes(x=lon_wwtp, y=lat_wwtp)) +
  geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(size=0.2, aes(color=level)) +
  coord_map(xlim = c(-180,180)) +
  labs(x=element_blank(), y=element_blank(), title="Locations of WWTPs with DF under 10 colored by Treatment Level")

Looking at this map visualization of where each WWTP of each treatment level is in the world, it is noticeable that WWTP’s are clustered in types of treatment levels, so I wanted to look further into what factors may contribute to the clusters of each treatment level type.

count_pop_treatment <- 
  under_10_df %>%
  group_by(pop_bin, level) %>%
  summarize(n=n())

ggplot(count_pop_treatment, aes(x=level, y = n, fill = pop_bin))+
  geom_col(position="dodge")+
  labs(x= "Treatment Type", y="Number of Plants", title="Number of Plants with DF under 10 in each Treatment Level by Population Size Served")

I originally thought that perhaps the population of people being served by certain WWTPs would explain the type of treatment that a WWTP had. We can see that the vast majority of plants that serve over 100k people use secondary treatment level.

p1 <- ggplot(data=under_10_df, aes(x=lon_wwtp, y=lat_wwtp)) +
  geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(size=0.2, aes(color=level)) +
  coord_map(xlim = c(-180,180)) +
  labs(x=element_blank(), y=element_blank(), title="WWTPs with DF <10 by Treatment Level")
p2 <- ggplot(data=under_10_df, aes(x=lon_wwtp, y=lat_wwtp)) +
  geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(size=0.2, aes(color=pop_bin)) +
  coord_map(xlim = c(-180,180)) +
  labs(x=element_blank(), y=element_blank(), title="WWTPs with DF <10 by population served")

grid.arrange(p1, p2, nrow= 1)

I decided to compare the plants treatment levels and population served side by side to investigate further regarding the clustering if their locations. As you can see above, there is some slight correlation. In North America, the colors of both plots seem to match up pretty well, as well as in China’s general region. However, there seems to be a mix in Europe. This may possibly explain, to some extent, the reason for the treatment level for each plant’s location.


  1. Particularly important questions about how it was collected include whether it is a sample of a larger data set, and, if so, how the sampling was done? Randomly? All cases during a specific time frame? All data for a selected set of users? Answers to such questions strongly impact the conclusions you will be able to draw from the data.↩︎

  2. The graphics in this one are a bit more developed than you would really see in an exploratory analysis, but I think the progression of visualizations is interesting and follows an exploratory story.↩︎